33 research outputs found

    Adaptive Algorithms for Automated Processing of Document Images

    Get PDF
    Large scale document digitization projects continue to motivate interesting document understanding technologies such as script and language identification, page classification, segmentation and enhancement. Typically, however, solutions are still limited to narrow domains or regular formats such as books, forms, articles or letters and operate best on clean documents scanned in a controlled environment. More general collections of heterogeneous documents challenge the basic assumptions of state-of-the-art technology regarding quality, script, content and layout. Our work explores the use of adaptive algorithms for the automated analysis of noisy and complex document collections. We first propose, implement and evaluate an adaptive clutter detection and removal technique for complex binary documents. Our distance transform based technique aims to remove irregular and independent unwanted foreground content while leaving text content untouched. The novelty of this approach is in its determination of best approximation to clutter-content boundary with text like structures. Second, we describe a page segmentation technique called Voronoi++ for complex layouts which builds upon the state-of-the-art method proposed by Kise [Kise1999]. Our approach does not assume structured text zones and is designed to handle multi-lingual text in both handwritten and printed form. Voronoi++ is a dynamically adaptive and contextually aware approach that considers components' separation features combined with Docstrum [O'Gorman1993] based angular and neighborhood features to form provisional zone hypotheses. These provisional zones are then verified based on the context built from local separation and high-level content features. Finally, our research proposes a generic model to segment and to recognize characters for any complex syllabic or non-syllabic script, using font-models. This concept is based on the fact that font files contain all the information necessary to render text and thus a model for how to decompose them. Instead of script-specific routines, this work is a step towards a generic character and recognition scheme for both Latin and non-Latin scripts

    The IDENTIFY study: the investigation and detection of urological neoplasia in patients referred with suspected urinary tract cancer - a multicentre observational study

    Get PDF
    Objective To evaluate the contemporary prevalence of urinary tract cancer (bladder cancer, upper tract urothelial cancer [UTUC] and renal cancer) in patients referred to secondary care with haematuria, adjusted for established patient risk markers and geographical variation. Patients and Methods This was an international multicentre prospective observational study. We included patients aged ≄16 years, referred to secondary care with suspected urinary tract cancer. Patients with a known or previous urological malignancy were excluded. We estimated the prevalence of bladder cancer, UTUC, renal cancer and prostate cancer; stratified by age, type of haematuria, sex, and smoking. We used a multivariable mixed-effects logistic regression to adjust cancer prevalence for age, type of haematuria, sex, smoking, hospitals, and countries. Results Of the 11 059 patients assessed for eligibility, 10 896 were included from 110 hospitals across 26 countries. The overall adjusted cancer prevalence (n = 2257) was 28.2% (95% confidence interval [CI] 22.3–34.1), bladder cancer (n = 1951) 24.7% (95% CI 19.1–30.2), UTUC (n = 128) 1.14% (95% CI 0.77–1.52), renal cancer (n = 107) 1.05% (95% CI 0.80–1.29), and prostate cancer (n = 124) 1.75% (95% CI 1.32–2.18). The odds ratios for patient risk markers in the model for all cancers were: age 1.04 (95% CI 1.03–1.05; P < 0.001), visible haematuria 3.47 (95% CI 2.90–4.15; P < 0.001), male sex 1.30 (95% CI 1.14–1.50; P < 0.001), and smoking 2.70 (95% CI 2.30–3.18; P < 0.001). Conclusions A better understanding of cancer prevalence across an international population is required to inform clinical guidelines. We are the first to report urinary tract cancer prevalence across an international population in patients referred to secondary care, adjusted for patient risk markers and geographical variation. Bladder cancer was the most prevalent disease. Visible haematuria was the strongest predictor for urinary tract cancer

    Clutter Noise Removal in Binary Document Images

    No full text
    The paper presents a clutter detection and removal algorithm for complex document images. The distance transform based approach is independent of clutter’s position, size, shape and connectivity with text. Features are based on a residual image obtained by analysis of the distance transform and clutter elements, if present, are identified with an SVM classifier. Removal is restrictive, so text attached to the clutter is not deleted in the process. The method was tested on a collection of degraded and noisy, machine-printed and handwritten Arabic and English text documents. Results show pixel-level accuracies of 97.5 % and 95 % for clutter detection and removal respectively. This approach was also extended with a noise detection and removal model for documents having a mix of clutter and salt-n-pepper noise

    Voronoi++: A Dynamic Page Segmentation approach based on Voronoi and Docstrum Features

    No full text
    This paper presents a dynamic approach to document page segmentation. Current page segmentation algorithms lack the ability to dynamically adapt local variations in the size, orientation and distance of components within a page. Our approach builds upon one of the best algorithms, Kise et. al. work based on Area Voronoi Diagrams [10], which adapts globally to page content to determine algorithm parameters. In our approach, local thresholds are determined dynamically based on parabolic relations between components, and Docstrum based angular and neighborhood features are integrated to improve accuracy. Zone-based evaluation was performed on four sets of printed and handwritten documents in English and Arabic scripts and an increase of 33 % in accuracy is reported
    corecore